36 research outputs found
Assessing and Remedying Coverage for a Given Dataset
Data analysis impacts virtually every aspect of our society today. Often,
this analysis is performed on an existing dataset, possibly collected through a
process that the data scientists had limited control over. The existing data
analyzed may not include the complete universe, but it is expected to cover the
diversity of items in the universe. Lack of adequate coverage in the dataset
can result in undesirable outcomes such as biased decisions and algorithmic
racism, as well as creating vulnerabilities such as opening up room for
adversarial attacks.
In this paper, we assess the coverage of a given dataset over multiple
categorical attributes. We first provide efficient techniques for traversing
the combinatorial explosion of value combinations to identify any regions of
attribute space not adequately covered by the data. Then, we determine the
least amount of additional data that must be obtained to resolve this lack of
adequate coverage. We confirm the value of our proposal through both
theoretical analyses and comprehensive experiments on real data.Comment: in ICDE 201
Online Maximum Independent Set of Hyperrectangles
The maximum independent set problem is a classical NP-hard problem in
theoretical computer science. In this work, we study a special case where the
family of graphs considered is restricted to intersection graphs of sets of
axis-aligned hyperrectangles and the input is provided in an online fashion. We
prove bounds on the competitive ratio of an optimal online algorithm under the
adaptive offline, adaptive online, and oblivious adversary models, for several
classes of hyperrectangles and restrictions on the order of the input.
We are the first to present results on this problem under the oblivious
adversary model. We prove bounds on the competitive ratio for unit hypercubes,
-bounded hypercubes, unit-volume hypercubes, arbitrary hypercubes, and
arbitrary hyperrectangles, in both arbitrary and non-dominated order. We are
also the first to present results under the adaptive offline and adaptive
online adversary models with input in non-dominated order, proving bounds on
the competitive ratio for the same classes of hyperrectangles; for input in
arbitrary order, we present the first results on -bounded hypercubes,
unit-volume hyperrectangles, arbitrary hypercubes, and arbitrary
hyperrectangles. For input in dominating order, we show that the performance of
the naive greedy algorithm matches the performance of an optimal offline
algorithm in all cases. We also give lower bounds on the competitive ratio of a
probabilistic greedy algorithm under the oblivious adversary model. We conclude
by discussing several promising directions for future work.Comment: 27 pages, 12 figure
Data-Centric Distrust Quantification for Responsible AI: When Data-driven Outcomes Are Not Reliable
At the same time that AI and machine learning are becoming central to human
life, their potential harms become more vivid. In the presence of such
drawbacks, a critical question one needs to address before using these
data-driven technologies to make a decision is whether to trust their outcomes.
Aligned with recent efforts on data-centric AI, this paper proposes a novel
approach to address the trust question through the lens of data, by associating
data sets with distrust quantification that specify their scope of use for
predicting future query points. The distrust values raise warning signals when
a prediction based on a dataset is questionable and are valuable alongside
other techniques for trustworthy AI. We propose novel algorithms for computing
the distrust values in the neighborhood of a query point efficiently and
effectively. Learning the necessary components of the measures from the data
itself, our sub-linear algorithms scale to very large and multi-dimensional
settings. Besides demonstrating the efficiency of our algorithms, our extensive
experiments reflect a consistent correlation between distrust values and model
performance. This underscores the message that when the distrust value of a
query point is high, the prediction outcome should be discarded or at least not
considered for critical decisions
Responsible Scoring Mechanisms Through Function Sampling
Human decision-makers often receive assistance from data-driven algorithmic
systems that provide a score for evaluating objects, including individuals. The
scores are generated by a function (mechanism) that takes a set of features as
input and generates a score.The scoring functions are either machine-learned or
human-designed and can be used for different decision purposes such as ranking
or classification.
Given the potential impact of these scoring mechanisms on individuals' lives
and on society, it is important to make sure these scores are computed
responsibly. Hence we need tools for responsible scoring mechanism design. In
this paper, focusing on linear scoring functions, we highlight the importance
of unbiased function sampling and perturbation in the function space for
devising such tools. We provide unbiased samplers for the entire function
space, as well as a -vicinity around a given function.
We then illustrate the value of these samplers for designing effective
algorithms in three diverse problem scenarios in the context of ranking.
Finally, as a fundamental method for designing responsible scoring mechanisms,
we propose a novel approach for approximating the construction of the
arrangement of hyperplanes. Despite the exponential complexity of an
arrangement in the number of dimensions, using function sampling, our algorithm
is linear in the number of samples and hyperplanes, and independent of the
number of dimensions
Efficient Computation of Subspace Skyline over Categorical Domains
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed
the way we search for accommodation, restaurants, etc. The underlying datasets
in such applications have numerous attributes that are mostly Boolean or
Categorical. Discovering the skyline of such datasets over a subset of
attributes would identify entries that stand out while enabling numerous
applications. There are only a few algorithms designed to compute the skyline
over categorical attributes, yet are applicable only when the number of
attributes is small.
In this paper, we place the problem of skyline discovery over categorical
attributes into perspective and design efficient algorithms for two cases. (i)
In the absence of indices, we propose two algorithms, ST-S and ST-P, that
exploits the categorical characteristics of the datasets, organizing tuples in
a tree data structure, supporting efficient dominance tests over the candidate
set. (ii) We then consider the existence of widely used precomputed sorted
lists. After discussing several approaches, and studying their limitations, we
propose TA-SKY, a novel threshold style algorithm that utilizes sorted lists.
Moreover, we further optimize TA-SKY and explore its progressive nature, making
it suitable for applications with strict interactive requirements. In addition
to the extensive theoretical analysis of the proposed algorithms, we conduct a
comprehensive experimental evaluation of the combination of real (including the
entire AirBnB data collection) and synthetic datasets to study the practicality
of the proposed algorithms. The results showcase the superior performance of
our techniques, outperforming applicable approaches by orders of magnitude
A Fair and Memory/Time-efficient Hashmap
There is a large amount of work constructing hashmaps to minimize the number
of collisions. However, to the best of our knowledge no known hashing technique
guarantees group fairness among different groups of items. We are given a set
of tuples in , for a constant dimension and a set of
groups such that every
tuple belongs to a unique group. We formally define the fair hashing problem
introducing the notions of single fairness ( for every ), pairwise fairness
( for every ), and the
well-known collision probability (). The goal is to
construct a hashmap such that the collision probability, the single fairness,
and the pairwise fairness are close to , where is the number of
buckets in the hashmap.
We propose two families of algorithms to design fair hashmaps. First, we
focus on hashmaps with optimum memory consumption minimizing the unfairness. We
model the input tuples as points in and the goal is to find the
vector such that the projection of onto creates an ordering that is
convenient to split to create a fair hashmap. For each projection we design
efficient algorithms that find near optimum partitions of exactly (or at most)
buckets. Second, we focus on hashmaps with optimum fairness
(-unfairness), minimizing the memory consumption. We make the important
observation that the fair hashmap problem is reduced to the necklace splitting
problem. By carefully implementing algorithms for solving the necklace
splitting problem, we propose faster algorithms constructing hashmaps with
-unfairness using boundary points when and boundary points for
Maximizing Neutrality in News Ordering
The detection of fake news has received increasing attention over the past
few years, but there are more subtle ways of deceiving one's audience. In
addition to the content of news stories, their presentation can also be made
misleading or biased. In this work, we study the impact of the ordering of news
stories on audience perception. We introduce the problems of detecting
cherry-picked news orderings and maximizing neutrality in news orderings. We
prove hardness results and present several algorithms for approximately solving
these problems. Furthermore, we provide extensive experimental results and
present evidence of potential cherry-picking in the real world.Comment: 14 pages, 13 figures, accepted to KDD '2